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Abstract 

The inferior temporal cortex (IT) of monkeys is thought to play an essential role in visual object recogni- 
tion. Inferotemporal neurons are known to respond to complex visual stimuli, including patterns like faces, 
hands, or other body parts. What is the role of such neurons in object recognition? The present study ex- 
amines this question in combined psychophysical and electrophysiological experiments, in which monkeys 
learned to classify and recognize novel visual 3D objects. A population of neurons in IT were found to 
respond selectively to such objects that the monkeys had recently learned to recognize. A large majority 
of these cells discharged maximally for one view of the object, while their response fell off gradually as the 
object was rotated away from the neuron's preferred view. Most neurons exhibited orientation-dependent 
responses also during view-plane rotations. Some neurons were found tuned around two views of the 
same object, while a very small number of cells responded in a view-invariant manner. For five different 
objects that were extensively used during the training of the animals, and for which behavioral perfor- 
mance became view-independent, multiple cells were found that were tuned around different views of the 
same object. No selective responses were ever encountered for views that the animal systematically failed 
to recognize. The results of our experiments suggest that neurons in this area can develop a complex 
receptive field organization as a consequence of extensive training in the discrimination and recognition of 
objects. Simple geometric features did not appear to account for the neurons' selective responses. These 
findings support the idea that a population of neurons - each tuned to a different object aspect, and each 
showing a certain degree of invariance to image transformations - may, as an assembly, encode complex 
3D objects. In such a system, several neurons may be active for any given vantage point, with a single 
unit acting like a blurred template for a limited neighborhood of a single view. 
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1 Introduction 

Object recognition can be thought of as the process of 
matching the image of an object to its representation 
stored in memory. Because different viewing, illumina- 
tion, and context conditions generate different retinal 
images, the nature of the stored representation and the 
process of normalization of the sensory input presents 
one of the greatest challenges to understanding biolog- 
ical recognition. It is well known that familiar objects 
are recognized regardless of viewing angle, scale or po- 
sition in the visual field. How is such perceptual object 
constancy accomplished? Does the brain transform the 
sensory or the stored representation to discard the image 
variability resulting from different viewing conditions, or 
does generalization occur as a consequence of perceptual 
learning, that is, of being acquainted with different in- 
stances of any given object? The present paper addresses 
one aspect of this issue, namely, how the primate recog- 
nition system may compensate for changes in viewing 
angle and distance, ignoring the image changes resulting 
from variation of the illumination and context. More- 
over, the issue is addressed at the level of subordinate 
categorizations of objects. 

Studies indicate that objects can be identified at a 
number of levels of abstraction, but are most easily rec- 
ognized at what is referred to as the basic level (Rosch 
et al., 1976). For instance, a barn swallow is perceived 
first as a bird , rather than as a swallow or an Avian. 
Classifications above the basic level are more general 
and are called superordinate. In contrast, subordinate 
level refers to classifications below the basic level and 
are more specific, sharing a great number of attributes 
with other members of the object class. The behavioral 
performance of humans for subordinate classifications is 
strongly view dependent (Rock and DiVita, 1987; Tarr 
and Pinker, 1990; Edelman and Bulthoff, 1992), pre- 
sumably because it largely relies on the recognition of 
subtle differences in the shape of complex objects. It 
is also this type of classification that is most seriously 
impaired by circumscribed damage to the human cere- 
bral cortex (Damasio, 1990). It appears that, at least in 
humans, distinct shape differences may be the basis for 
reliable object recognition under any viewing conditions. 
Objects with distinct shape are easiest and fastest recog- 
nized whether of a basic-level or not. For instance a pen- 
guin, i.e. an atypical exemplar the basic-level category 
birds, is most likely to be first recognized as "penguin" 
rather than as a "bird" , a classification termed entry 
level recognition (Jolicoeur et al., 1984). Penguins do 
indeed have a distinct shape when compared with most 
other animals, but also differ a great deal from any other 
bird. 

Conceptual hierarchies like those mentioned above re- 
flect certain types of interactions between the human 
perceiver and objects in the environment. As such they 
also reflect the "default" probabilities of the required 
discriminations for any given class of objects. Thus in a 
domain of expertise, subordinate-level categories maybe 
as differentiated as the basic-level categories, and the for- 
mer categorizations may be as fast as the latter (Tanaka 
and Taylor, 1991). Clearly, in the nonhuman primate 



categories have no bearing on language. Nonetheless, 
there is little doubt that monkeys are capable of cate- 
gorizations of objects like predators, prey, infant mon- 
keys, or food; categories of objects usually having distinct 
shape differences. It has also been shown that monkeys 
can be trained to be "experts" in discriminations of ob- 
jects of a novel class, the members of which share great 
shape similarities (Logothetis et al., 1994). It is this lat- 
ter type of object discriminations that was used to study 
the spatial reference system of object representations in 
the non-human primate and the activity of neurons in 
the temporal cortex during the execution of the recogni- 
tion task. 

The reference system used in matching object shapes 
to their representations encoded in visual memory is a 
key question in the research of visual object recognition 
(Farah, 1985; Ullman, 1989; Tarr and Pinker, 1989). 
Theories relying on object-centered representations as- 
sume either a complete three-dimensional description 
of an object (Ullman, 1989), or a structural descrip- 
tion of the image that specifies the relationships among 
viewpoint-invariant volumetric primitives (Marr, 1982; 
Biederman, 1987). Whereas such theories correctly pre- 
dict the view-independent recognition of familiar objects 
(Biederman, 1987), they fail to account for performance 
in recognition tasks with of novel objects at the subordi- 
nate level (Rock k DiVita, 1987; Rock et al., 1981; Tarr 
& Pinker, 1990; Bulthoff and Edelman, 1992; Edelman 
& Bulthoff, 1992). Viewpoint-dependent, image-based 
models, on the other hand, represent three-dimensional 
objects as a set of 2D views, or aspects, and recognition 
consists of matching image features against the views in 
this set. 

Although such models can account for the perfor- 
mance of human subjects in any recognition task, they 
are usually considered implausible because of the mem- 
ory a system would require to store all discriminable 
views of many objects. These objections, however, have 
recently been challenged by computer simulations show- 
ing that a simple network can recognize 3D objects by 
interpolating between a small number of stored views 
(Poggio and Edelman, 1990; Logothetis et al., 1994). 
This network (Figure 1) uses a small set of sparse data, 
corresponding to an object's training views, to synthe- 
size an approximation of a multivariate function (Poggio 
and Girosi, 1990) representing the object. 

In such a network a view can be represented by a set 
of any image features, such as the orientations or po- 
sitions of object parts, shape metrics, texture, or color. 
Complex features can be created hierarchically from sim- 
pler ones as shown in Figure 1. The performance of the 
network was tested with geometrical features like the po- 
sition of the vertices of wire-objects (Poggio & Edelman, 
1990), or their orientations (Logothetis et al., 1994), or 
with features extracted from real images of wire-objects 
(Brunelli and Poggio, 1991b) or faces (Brunelli and Pog- 
gio, 1991a). The actual features used by a biologi- 
cal recognition system are presently unknown and their 
nature is an important experimental question per se. 
Nonetheless, some of the arbitrary features used in the 
simulations can provide a measure of object similarity. 
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Figure 1: (a) Performance of a regularization network trained with the 0°, 60°, 120°, and 180° views of an wire- 
object. Each "hidden-layer" unit takes a similarity-measure between a novel view and a template stored in the 
unit's memory, by calculating the euclidean distance ||V — T;|| of the input vector V from its learned view T;, and 
subsequently computing the function ^(V) = exp( — ||V — T 8 || 2 ) of this distance. The activity of the entire network 
is conceived of as the weighted, sum of each unit's output (.F(V) = X2i=i c i ex p(~ ||V — T 8 || 2 )). A decision criterion 
can be applied for yes/no type of performance. The basic scheme can be hierarchically used for composing complex 
features out of simpler ones (small inset). 



Based on such features, simple simulations argue against 
the implausibility of a view-based recognition system. 

Also in agreement with the basic idea that a lim- 
ited number of views might be sufficient to accomplish 
view-invaraince, are recent psychophysical experiments 
showing that human subordinate-level recognition per- 
formance can be best predicted by assuming that sub- 
jects interpolate between familiar object views (Bulthoff 
k Edelman, 1992; Edelman k Bulthoff, 1992). Similar 
performance has been observed in nonhuman primates 
performing a subordinate level recognition task (Logo- 
thetis et al., 1994). It was shown that monkey's were 
limited in their ability generalize recognition to novel 
views of an object, performing best for a most familiar 
view and gradually worse for views with increasing dis- 
tance from the known view. Familiarity with two views 
of an object allowed the interpolation of recognition be- 
tween the views if they were close enough together, say 
75° apart, but resulted in two independent regions of 
generalization if they were far apart, say 160°. In most 
cases, however, only three to five familiar views were 
needed for the animal to achieve view-invariant perfor- 
mance around one axis. 

A recognition architecture that could underlie such 
performance might rely on small-scale networks with 
units that are broadly tuned to views or features of a 
learned object. Neurons responding to complex 2D pat- 
terns, including face or hand views (Gross et al., 1972; 
Bruce et al., 1981; Rolls, 1984; Desimone et al., 1984; 
Yamane et al., 1988), have indeed been reported in infer- 
otemporal cortex of the monkey by different researchers 
(Richmond et al., 1987; Miyashita, 1988; Tanaka et al., 



1991; Fujita et al., 1992). Such cells discharge more 
strongly to complex patterns than to any simple stimu- 
lus, and are found even in the earliest stages of ontogeny 
of the primate (Rodman et al., 1993). A detailed inves- 
tigation of the cells showing high selectivity for faces has 
revealed several different types or classes of neurons in 
the superior temporal sulcus, each broadly tuned to one 
view of the head, e.g. full face or profile (Perrett, 1985). 
Similarly, neurons have been reported that respond selec- 
tively to static or dynamic information about the body, 
or body parts, some of which were dependent on the ob- 
server's vantage point (Perrett et al., 1989; Wachsmuth 
et al., 1994). Is such a configurational selectivity specific 
only for faces or body parts, or can it be generated for 
any novel object as a result of extensive training? 

Clinical observations have shown that the recognition 
of living things can be selectively impaired (Farah et al., 
1991). This may imply that the perception of faces or bi- 
ological forms in general is mediated by specialized neu- 
ral populations. If so, then the complex-pattern selec- 
tivity (faces, body parts, etc.) reported in the above 
studies may be unique to the representation of the class 
of "living things" , with different encoding mechanisms 
responsible for the recognition of other objects. In gen- 
eral, objects may be represented by large populations 
of cells each encoding a simple feature, or the conjunc- 
tion of simple features that are characteristic for a given 
class. Alternatively, a system based on neurons selec- 
tive for complex configurations may provide one mecha- 
nism for encoding any object that cannot undergo much 
meaningful decomposition in the course of recognition. 
Some subordinate categorizations cannot rely on part 



decomposition. We are unlikely to recognize individual 
faces, for example, by simply detecting the existence of 
two eyes, the nose and the mouth, as each individual 
is likely to have the same parts in approximately the 
same positions. It is a holistic and/or a metric repre- 
sentation that probably underlies the recognition of a 
person's face. The same reasoning may apply for the 
recognition of individual objects of other classes, partic- 
ularly artificial objects composed of similar parts. Thus, 
the question arises: If monkeys are extensively trained 
to identify novel 3D objects of a class whose members 
show a great deal of structural similarity, then would 
one find neurons in the brain which respond selectively 
to the views of such objects? 

We have examined this possibility using two classes 
of novel, computer-rendered stimuli: Gouraud-shaded 
wire-like and amoeboid objects (Biilthoff & Edelman, 
1992; Edelman k Biilthoff, 1992; Logothetis et al., 1994). 
The monkeys were trained in a matching task, general- 
ized across translation, scaling and orientation changes. 
Within an object class the target-distractor similarity 
varied between one extreme, where distractors were gen- 
erated by randomly selecting shape-parameters, such as 
the positions of vertices or protrusions, the sharpness of 
angles between segments, or the moment of inertia of the 
objects, and the other extreme, where distractors were 
generated by adding different degrees of noise to the pa- 
rameters of the target. A variety of other digitized 2D 
or 3D patterns, e.g. , geometric objects, scenes, body- 
parts, were also used as controls in the physiological ex- 
periments. 

2 Methods 

2.1 Subjects and Surgical Procedures 

Two juvenile rhesus monkeys (Macaca mulatto) weigh- 
ing 7-9 kg were tested in the electrophysiological studies. 
The animals were cared for in accordance with the Na- 
tional Institutes of Health Guide, and the guidelines of 
the Animal Protocol Review Committee of the Baylor 
College of Medicine. 

After preliminary training, the animal underwent 
a aseptic surgery, using isoflurane anesthesia (1.2% - 
1.5%), for the placement of the head restraint post and 
the scleral search eye coil. Throughout the surgical pro- 
cedure the heart rate, blood pressure and respiration 
were monitored constantly and recorded every 15 min- 
utes. Body temperature was kept at 37 degrees using a 
heating pad. Postoperatively, the monkey was adminis- 
tered an opioid analgesic (Buprenorphine hydrochloride 
0.02 mg/kg, IM) every 6 hours for one day, and Tylenol 
(10 mg/kg) and antibiotics (Tribrissen 30 mg/kg) for 
3-5 days. At the end of the training period another ster- 
ile surgery was performed to implant a chamber for the 
electrophysiological recordings. 

2.2 Animal Training 

Standard operant conditioning techniques with positive 
reinforcement were used to train the monkey to perform 
the task. Initially, the animals were trained to recognize 
a target's zero view among a large set of distractors. 



When they had learned the zero view they were encour- 
aged to generalize recognition to neighboring views re- 
sulting from progressively larger rotations around one 
axis. The criterion required before training with another 
object was 95% correct over a range of ±90° for the tar- 
get, and less than 5% false alarm rate for all distractors. 
In the early stages of training several days were required 
to train the animals to perform the same task for a new 
object. Four months of training was required on average 
for the monkey to learn to generalize the task across dif- 
ferent types of objects of one class, and about six months 
were required for the animal to generalize for different 
object classes. 

The similarity of the targets to the distractors was 
gradually increased within an object class. In the fi- 
nal stage of the experiments distractor wire-objects were 
generated by adding different degrees of position or ori- 
entation noise to the target objects. A criterion of 95% 
correct for several objects was required to proceed with 
the psychophysical data collection. 

In the initial training phase, the animal received con- 
tinuous feedback about its performance. Each correct 
response was rewarded with a drop of juice. In the later 
stages of the training the animals were reinforced on a 
variable-ratio schedule which administered a reward af- 
ter a specified average number of correct responses had 
been given. Finally, in the last stage of the behavioral 
training the monkey was rewarded only after ten con- 
secutive correct responses. The end of the observation 
period was signalled with a full-screen, green light and a 
juice reward for the monkey. The variable-ratio schedule 
was also used throughout the period of psychophysical 
data collection. 

During the behavioral training, independent of the re- 
inforcement schedule, the monkey always received feed- 
back as to the correctness of each response. Incorrect 
reports aborted the entire observation period. During 
psychophysical data collection, on the other hand, the 
monkey was presented with novel objects and no feed- 
back was given during the testing period. The behavior 
of the animals was monitored continuously during the 
data collection by computing on-line hit rate and false 
alarms. Arbitrary performance or the development of 
hand-preferences, e.g. giving only right hand responses, 
was discouraged during psychophysical data collection 
by randomly interleaving sessions of actual data collec- 
tion with sessions in which a novel object was presented 
but correct performance was required of the animal (i.e., 
incorrect responses resulted in aborts). 

In the electrophysiological experiments the animal 
was required to maintain fixation throughout the en- 
tire observation period. Eye movements were measured 
using the scleral search coil technique and digitized at 
200Hz. 

2.3 Electrophysiological recording 

Recording of single unit activity was done using 
Platinum-Iridium electrodes of 2-3 Megohms impedance. 
The electrodes were advanced into the brain through 
a guide tube mounted into a ball-and-socket positioner 
(Monkey S5396: AP = 15, L = 22; Monkey B63A 
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Figure 2: The experimental paradigm. Each observation period began with the presentation of a fixation spot. 
Successful fixation was followed by the learning phase, after which up to ten single, static views of either the target 
or a distractor were presented sequentially (testing phase). The subject was required to respond to each one in turn, 
indicating a choice of "target" by pressing the right lever or "distractor" by pressing the left lever. Fixation was 
maintained for the duration of the observation period. 



AP = 19, L = 22). By swivelling the guide tube dif- 
ferent sites could be accessed within an approximately 
10x10mm cortical region. Action potentials were ampli- 
fied (Bak Electronics, Model 1A-B), and routed to an 
audio-monitor (Grass AM-8) and to a time-amplitude 
window discriminator (Bak Model DIS-1). The output 
of the window discriminator was used to trigger the real- 
time clock interface of a PDP11/83 computer. 

2.4 Visual stimuli 

The visual objects were presented on a monitor situated 
97 cm from the animal. The selection of the vertices of 
the wire objects within a three-dimensional space was 
constrained to exclude intersection of the wire-segments 
and extremely sharp angles between successive segments, 
and to ensure that the difference in the moment of in- 
ertia between different wires remained within a limit of 
10%. Once the vertices were selected the wire objects 
were generated by determining a set of rectangular facets 
covering the surface of a hypothetical tube of a given ra- 
dius that joined successive vertices. 

The spheroidal objects were created through the gen- 
eration of a recursively-subdivided triangle mesh ap- 
proximating a sphere. Protrusions were generated by 
randomly selecting a point on the sphere's surface and 
stretching it outward. Smoothness was accomplished by 
increasing the number of triangles forming the polyhe- 
dron that represents one protrusion. Spheroidal stimuli 
were characterized by the number, sign (negative sign 
corresponded to dimples), size, density and sigma of 
the gaussian type protrusions. Similarity was varied by 
changing these parameters as well as the overall size of 
the sphere. 

Test-views were typically generated by ±10 to ±180 



degree rotations around the vertical (Y), horizontal (X), 
or the two oblique (±45°) axes lying on the XY plane. 

2.5 Data Analysis 

Mean spike rates are distributed symmetrically, that is 
the mean is an accurate representation of central ten- 
dency coinciding with the median of the distribution. 
The significance of differences between mean spike rates 
measured during the target presentations and those mea- 
sured during the distractor presentations can therefore 
be tested by using the non-parametric Walsh test for 
two related samples (Walsh, 1949). For our sample 
size (N = 9 presentations per target-view or distrac- 
tor), the power-efficiency, i.e. approximately the per- 
centage of the total available information per obser- 
vation which is utilized by the test, of the one-tailed 
Walsh test at a = 0.011 is 98% of that of the para- 
metric t test at a = 0.05, while it avoids the the use 
of assumption-laden dispersion measures. The neurons 
presented here as view-selective gave equal or greater 
responses to target views than to the views of the dis- 
tractors, at a = 0.0Tl(mm[d3, \{dl + rf5)] > 0). 

3 Results 

3.1 View selectivity 

Figure 2 describes the sequence of events that composes 
a single observation period. An observation period be- 
gan with the presentation of a small fixation spot. Suc- 
cessful fixation was followed by the learning phase, dur- 
ing which the target was presented for 2 to 4 seconds 
from one viewpoint. This view of the target, called the 
training view, was presented in oscillatory motion ±15° 
around a fixed axis at 0.67Hz to provide the subject with 



complete 3D structure information. The learning phase 
was followed by a short fixation period after which the 
testing phase started. A testing phase consisted of up 
to 10 sequential trials, in each of which the test stim- 
ulus, a static view of either the target or a distractor, 
was presented. Thirty target views 12° apart and 60 to 
120 distractors were tested in a given session. The dura- 
tion of stimulus presentation was 500-800 msec, and the 
monkeys were given 1500 msec to respond by pressing 
one of two levers: the right lever upon presentation of 
a target view and the left upon presentation of a dis- 
tractor. Typical reaction times were below 1000 msec 
for both animals. An experimental session consisted of 
a sequence of 60 observation periods, each lasting about 
25 seconds. 

A total of 970 IT cells were recorded from two mon- 
keys during combined psychophysical and electrophys- 
iological experiments, in which the subject performed 
either a fixation task, or the recognition task described 
above. All data barring those shown in the last figure 
were collected using objects that the monkeys could rec- 
ognize from any viewpoint (hit rate above 95% for all 
views, and false alarm below 5% for all distractors). The 
animals' view-invariant performance in the case of these 
objects was a result of training on multiple views, which 
lead to generalization around an entire axis, and even- 
tually giving feedback for all views. A large majority of 
the isolated neurons were visually active when plotted 
with a variety of simple or complex stimuli, including 
some of the wire or spheroidal objects. Other neurons 
were inhibited by the presentation of target objects, and 
a small fraction of cells were inhibited by any stimulus 
including the fixation spot. 

A number of units, however, responded selectively to a 
subset of views of one of the known target objects, firing 
much less or not at all for the distractors. The response 
of these neurons for different views was approximated by 
fitting to the data a gaussian function centered on the 
view eliciting the greatest response. If a cell responded 
to two subsets of views, as was the case for several cells, 
the linear sum of two gaussian functions, one centered on 
each "most effective" view, was used to fit the response. 
The standard deviation of these functions, which can be 
viewed as a measure of the generalization field of the cell, 
was used to classify the neurons based on the following 
criterion. Cells (N = 61) were considered selective if they 
responded significantly more to target views within two 
standard deviations of the preferred view, than for any 
of the distractors (see methods). 

An example of a view-selective neuron is shown in Fig- 
ure 3a. The cell's firing rate reached a maximum upon 
presentation of one particular object view and declined 
as the object was rotated away from this preferred view . 
Figure 3b shows sixteen out of the 60 tested distractor 
wire-objects and an associated histogram of the response 
each elicited. The within-class recognition task the an- 
imal was performing during the electrophysiological ex- 
periments provided an internal control against common 
or trivial features being responsible for the behavior of 
the neurons. Examination of the views of the target 
for which the cell is selective reveals a couple features 



that may be characteristic for that view of the target. 
For example, the inverted "V" (circled) in the 0° view in 
Figure 3a, appears to be a prominent feature that all the 
response-eliciting target views have in common. Could 
the neuron simply be selectivly firing for the presence 
of this particular feature? This is not likely to be the 
case as an inverted "V" is also present in several of the 
distractors (see the circled regions of distractors 18, 25, 
44, 49, 50 in Figure 3b). 

Similar results were obtained with the class of 
spheroidal objects (Figure 4). Here, too, the neuron re- 
sponds maximally to one view of the object, 72° away 
from the zero-view, with its response declining as the 
angle of rotation deviates in either direction from the 
preferred view. Figure 4b shows the "best-response" 
eliciting distractors. Although all views of the target 
have one particular protrusion which remains visible in 
all views, this alone does not seem to be sufficient to 
elicit any sort of response. As indicated by the circled 
region of view "72°" , all of the views eliciting a signifi- 
cant response share the presence of a "face-like" region 
containing two dimples and a small protrusion in the 
lower right. However, similar regions are also present in 
two of the distractors, 12 and 14 in the bottom half of 
the figure, and neither of these elicit any activity from 
the cell whatsoever. 

The generalization field of a number of view-selective 
neurons was examined for all rotations in depth using 
views neighboring the preferred view along all four axes. 
An example is shown in Figure 5a. This cell responded 
best to the 0° view of the object and its response mag- 
nitude decreased with increasing angle of rotation along 
all axes. A small percentage of the view-selective cells 
(5 out of 61) exhibited their maximum discharge rate for 
two views 180 degrees apart (Figure 5b). The same pat- 
tern was observed in the behavioral performance of the 
monkeys for several objects (Logothetis et al., 1994). In 
both cases, this type of response was specific to wire-like 
objects whose zero and 180° views appeared as mirror- 
symmetrical images of each other, due to accidental min- 
imal self-occlusion. 

Figure 6 shows the distribution of the generalization 
fields of view-selective cells for the wire-like and the 
spheroidal objects. The insets show the coefficients of de- 
termination indicating the goodness of fit. Both object- 
types gave similar tuning width, which was always less 
than or equal to the behavioral generalization field of 
monkeys trained with one view of similar objects (Logo- 
thetis et al., 1994). 

A number of the objects used extensively during the 
training of the animal were also used during the electro- 
physiology sessions. For several of these objects, multi- 
ple neurons were found that were selective for different 
views of the same object. Figure 7a through 7d illus- 
trates such a case for four units. Three out of the 970 
cells responded selectively to specific objects presented 
from any viewpoint. Figure 7e shows such a neuron that 
appears to have properties of object-center descriptions. 
The cell responds about equally well for all target views 
and significantly less to any of the 120 distractors. 
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Figure 3: View-selective response of an IT neuron for a wire-like object. Peristimulus histograms (PSTHs) show the 
activity of a view-selective neuron when (a) the target or (b) distractors were presented. The ordinate and abscissa, 
labeled in the lower left, are the same for both the upper and lower sets of histograms. The insets show he target 
and the distractors views. The boxed plot is the zero view, presented in the learning phase. Note that the activity of 
the neuron for a given target view is well above that for distractors up to ±36° from the preferred view, defining the 
generalization field of the neuron. The dashed circles in the upper half (0° view) and in the lower half (distractors 
18, 25, 44, 49, 50) of the figure serve to highlight one of the features, an inverted "V", which all of these images have 
in common (see text). 
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Figure 4: View-selective response of a neuron for a spheroidal object. Conventions as in Figure 3. 
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Figure 5: (a) Response of a view-selective neuron to rotations around the preferred view along four axes. The 
z-dimension of the plot is spike rate and the x and y dimensions show the degrees of rotation of the target object 
along either or both of these axes. The volume was generated by testing the cell's response for rotations out to ±60° 
around the x and y axes as well as along the two diagonals. The magnitude of response fell of about the same for 
rotations away from 0° along all of the axes tested. The activity of the neuron for the 60 distractors is shown in 
the inset, (b) Response of a neuron selective for pseudo-mirror-symmetric views, 180° apart, of a wire-like object. 
The filled circles are the mean spike rates for target views around one axis of rotation. The solid black line is a 
DWLS-smoothed view-tuning curve. The two inset images depict the —120° and 60° views around both of which the 
neuron showed view-selective tuning. The activity of the neuron for the 60 different distractor objects used during 
testing is shown in the inset gray box. 
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Figure 6: Distribution of the standard deviation of the gaussians fitted to the view-tuning curves of IT neurons for 
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The gray bars show the three units that responded in a view-invariant manner for a given object. The insets show 
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Figure 7: (a) - (d) View-selective responses of neurons tuned to different views of the same wire-object. All data 
come from the same animal (S5396). The filled circles are the mean spike rates (N=10), and the thin black lines 
DWLS-smoothed view-tuning curves. The thick gray lines are a nonlinear approximation of the data (QNMT) with 

the function R(0) = X2i=i c i ex p(~ (||$ — ^i||) 2 /2o'f) + Ro , where N = 1 or 2. (e) An example of a neuron showing 
view-invariant repsonse for a known wire object. The behavioral performance of the monkey for this object was 
view-independent due to its having been used as a training object (see text). The insets in (a) through (e) show the 
activity of the neuron the 60 or 120 distractors used during testing. 
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3.2 Translation and scale invariance 

Among the population of neurons examined, we could 
identify a number of units that showed a large degree of 
size invariance. Figure 8 is an example of a view selective 
neuron the response of which was found to be invariant 
to changes in size. Whether the stimulus substended one 
degree of visual angle or six degrees the magnitude of the 
cells response was the same. Note that the fixation spot, 
the only unchanging part of the stimulus, did not elicit a 
response from the cell during the first 500ms of the trial 
before the stimulus onset. Figure 9 shows the response 
of the same cell when tested for positional invariance. In 
this case the center of the stimulus was translated 7.5 
degrees from the fixation spot. With the exception of 
the brief on-transient, the cell's activity does not deviate 
from the baseline for all tested positions. Thus, this cell, 
while scale invariant, appears to be position dependent 
for relatively large displacements. The responses shown 
in Figures 8 and 9 were collected during a simple fixation 
task. 

The response of eight view-selective neurons were 
tested for scale and translation invariance in the context 
of the object recognition task using the preferred view 
of the object. The stimulus sizes used subtended from 
1.9 to 5.6 degrees of visual angle, and the positions were 
tested all at a radial distance of 3.15 degrees. An exam- 
ple of a view-selective neuron responding invariantly to 
changes in both size and position is shown in Figure 10. 

This particular cell was selective when a limited re- 
gion of the object around 120 degrees (Figure 10a) was 
presented, and responded 3.5 times more for the pre- 
ferred target view than for the best distractor (Figure 
10b). Responses to scaling and translation were tested 
using the preferred view. Figure 10c shows the ratio 
of the target response to the mean response for the ten 
best distractors for the sizes tested. Note that all of the 
distractors were of the default size and were presented 
foveally. The responses of the same cell to translation 
are plotted in Figure lOd. This particular neuron showed 
some variance in its response depending on stimulus po- 
sition, however, in all cases its response for an eccen- 
trically presented target was still at least twice that for 
foveally presented distractors. Seventy-five percent of 
the tested neurons gave only scale-invariant responses 
while 35% were invariant for both scale and position. 

3.3 Responses to rotations in the view plane 

Neurons were also tested for rotation in the view plane. 
Most units appeared to be orientation selective (Figure 
lib). However, the initial performance of the animal 
also appeared to be orientation dependent for any given 
novel object rotated in the view plane (Figure 11a). In 
almost all cases, however, the initial generalization field 
for picture-plane rotations appears to be broader than 
that typically obtained for rotations in depth (Logo- 
thetis et al., 1994). Figure lie illustrates the behavioral 
progression of one animal's recognition performance as 
it evolved from initially view-dependent to almost com- 
pletely view-invariant for two different objects. Gener- 
alization performance often progressed rapidly, over the 
course of a few test sessions, to view-invariant perfor- 
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mance. This is in strong contrast to the view-dependent 
performance seen for rotations in depth, which changed 
very little for the duration of testing (as many as fifteen 
sessions without feedback). 

4 Discussion 

The results of this study suggest an experience depen- 
dent plasticity in IT neurons, and support the idea of 
a population of neurons with configurational selectiv- 
ity being a more general mechanism for encoding com- 
plex, "non-decomposable" objects. The neurons dis- 
cussed above responded selectively to novel objects that 
the monkey had recently learned to recognize. None 
of these objects had any prior meaning for the animal, 
nor did they resemble anything familiar in the monkey's 
environment. View-selective responses were found for 
both object types tested and were not limited to any 
one single region of the an object. However, when cells 
were tested with objects, which the monkey could rec- 
ognize only from a specific viewpoint, no selective re- 
sponses were ever encountered for views that the an- 
imal systematically failed to recognize. The reported 
cell responses are unlikely to reflect a general sensa- 
tion of familiarity or arousal, since the majority of the 
neurons responded selectively to a subset of the tested 
object-views, even when the animal's recognition per- 
formance was view-invariant (as in all cases except in 
Figure 11). Thus it seems that neurons in this area may 
develop complex, configurational selectivity as the ani- 
mal is trained to recognize specific objects. Such neu- 
rons can be regarded as "blurred-templates" , the tol- 
erance of which to small rotations in depth represents 
a form of limited generalization. The capacity of some 
IT neurons to respond to both an object view and its 
"pseudo-mirror-symmetrical" view can be viewed as a 
broader form of generalization, possibly underlying the 
reflection-invariance observed during the psychophysical 
experiments (Logothetis et al., 1994). Distinguishing 
mirror images has no apparent usefulness to any animal, 
and the inability of normal children to distinguish be- 
tween mirror-symmetrical letters or words (Orton, 1928; 
Corballis and McLaren, 1984) may be an adaptive mode 
of processing visual information, and not a "confusion" 
(Bornstein et al., 1978; Gross and Bornstein, 1978). In 
fact, theoretical and psychophysical work suggests that 
reflection-invariance facilitates the recognition of bilater- 
ally symmetric visual objects (Vetter et al., 1994). Inter- 
estingly, neurons responding to mirror-images of a face 
appear very early in the visual system of the monkey 
(Rodman et al., 1993). 

A significant number of neurons showed response in- 
varinace to affine image transformations. Similar re- 
sponse behavior has been earlier reported for 2D pat- 
terns like the Fourier descriptors (Schwartz et al., 1983) 
and for faces (Desimone et al., 1984; Rolls and Baylis, 
1986; Tovee et al., 1994). In our sample, position in- 
variance varied from one extreme, where response was 
strongly reduced with small translation (often less than 2 
degrees), to the other extreme where response remained 
largely invariant for eccentricites up to 7.5 degrees. 

Surprising was the degree of view-dependency of the 
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Figure 8: Response invariance to changes in size in a view-tuned neuron. The monkey was performing a simple 
fixation task in which each trial lasted 2500ms. PSTHs show the activity of the neuron over the course of a trial. 
The ordinate is spike rate and the abscissa is time. The animal fixated without a stimulus for the first 500ms at which 
point a stimulus would appear (indicated by the dashed line), and it continued to fixate for 2000ms, responding to a 
change in fixation spot color at the end of the trial. Each stimulus is shown to the side of its respective histogram. 
The circled stimulus is the one used for testing view-selectivity. 
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Figure 9: Responses to translation of an object in the picture-plane. Data are from the cell presented in Figure 7. 
The activity of the neuron for the default wire presented foveally (shown in Figure 7) is represented here by the 
black histogram in the background of each plot. The gray PSTHs show the activity of the cell for the eight positions 
tested. In each case the center of the wire was translated 7.5 degrees from the central fixation spot. Other than a 
short transient of activity, cell activity is barely distinguishable from baseline when the stimulus is presented at each 
of the eccentric positions. For smaller translations (less than 2 degrees), however, no such position dependence was 
observed. 
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Figure 10: A view-selective neuron responding invariantly to changes in size size and position, (a) Tuning curve 
showing activity of the neuron for a limited region of the object. The preferred view corresponds to a 120° rotation 
of the object around the Y-axis, (b) The responses of the cell for the ten best distractors. Distractors were always 
presented foveally and at the default size. The best target view was used to examine the cell's response to changes 
in size (c) and position (d). The response of the cell is plotted in both graphs as a ratio of the mean-spike-rate for a 
target view to the mean of the mean-firing rates for the top ten distractors. The bar representing the response to the 
default size, is indicated by the asterisk in (c). The smallest size, 1.9°, was used to test translation. The ordinate of 
the graph indicates the position of each test image in terms of its azimuth and elevation. 
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Figure 11: View-dependent behavioral performance and view-selective neuronal response for an image rotated in the 
picture-plane, (a) Performance of the animal in terms of hit rate (N = 9 trials per view). In this example, no training 
was given for the zero view prior to testing, (b) The plot depicts the view-tuning curve of the neuron in terms of 
mean-spike-rate. The abscissa of both plots is rotation angle, (c) Improvement of performance for recognition of 
views resulting from view-plane rotations. The X-axis is rotation angle, the Y-axis increasing session number, and 
the Z-axis hit rate. One test session included ten presentations of each target view, thirty-six in all, spaced at ten 
degree intervals. Each curve, starting in the front and proceeding to the back, illustrates the performance over two 
test session (N = 20 presentations of each target view). The animal was familiarized with the zero-view of the 
object during one brief training session prior to testing. No feedback was given during the testing periods as to the 
correctness of the response. 
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cell and the monkey responses for rotations in the plane 
of view. Psychophysical studies in humans have revealed 
that the recognition of objects rotated in the picture- 
plane is different than the recognition of objects ro- 
tated in depth. For example, Tarr and Pinker (Tarr k 
Pinker, 1989, 1990; Tarr and Pinker, 1991) studied the 
effects of rotation in the picture plane on recognition and 
found that familiarization with one view of an object re- 
sults in view-independent performance, although reac- 
tion times do increase with deviation from the learned 
view. This performance can be altered by training the 
subjects briefly on a second view, resulting in an im- 
provement in performance around the new learned view 
and to a lesser extent for those views between the two 
familiar views. In our experiments, the behavior of the 
monkeys was initially strongly view-dependent in terms 
of error rate. In contrast to the recognition performance 
observed for rotations of the object in depth, however, 
hit rate for view-plane rotations increased gradually over 
successive sessions without any feedback to the animal 
as to the correctness of its response. No neuron was iso- 
lated long enough to observe any possible changes at the 
single-cell level. 

A question that arises from these results is: are such 
neurons really responding to the "views" of the tested 
objects? Studies by Tanaka and his colleagues (Tanaka 
et al., 1991) showed, for instance, that the response of 
many neurons to complex objects can be mimicked using 
simpler forms representing regions of the objects. In a 
similar vein, the neurons studied here could be respond- 
ing to a reduced set of features of the wire or spheroidal 
objects and not to an entire view. Two observations 
seem to refute such an alternative. Firstly, the neurons 
were tested with a variety of simple objects, including 
geometric patterns of different orientations, that failed 
to elicit any response. Second, the presentation of be- 
tween 60 and 120 distractors from the same or a different 
object class served as a selectivity-control for each of the 
targets. Thus in the case of the wire-objects, for exam- 
ple, given the largerly invariant responses of IT neurons 
for small translations (Tovee et al., 1994), the distractors 
had at least 60 different combinations of simple features 
like orientations, angles, or terminations, some of which 
were highly similar to those comprising the target ob- 
ject. As a matter of fact, several cells did respond to the 
presentation of the target and to a number of distrac- 
tor objects, presumably excited by such simpler features. 
However, the selective cells discussed here gave minimal 
and sometimes no response for distractor objects, even 
when the latter shared a few characteristic regions with 
the target, indicating that a specific organization of some 
features was required for eliciting the neuron's response. 

Nevertheless, both arguments are based on qualita- 
tive observations, and what we present here as "view- 
selectivity" may still be reducible to less complex fea- 
ture constellations. A systematic, mathematical analy- 
sis of object-views that elicit similar neural responses, 
and an attempt to develop algorithms for biologically- 
plausible image decomposition may provide an answer 
to the selectivity question, and this is the focus of cur- 
rent experiments. 
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5 Conclusions 

Taken together, these data suggest the possibility of a 
recognition architecture similiar to that schematically 
described in Figure 1. The discharge rate of many IT 
neurons was found to be a bell-shaped function of orien- 
tation centered on a preferred view. A very small number 
of neurons exhibited object-specific but view-invariant 
responses that might be the result of the convergence of 
view-dependent units into neurons showing characteris- 
tics of object-centered descriptions. The input of each 
view-selective unit can be considered as the conjunction 
of simpler features extracted at earlier stages in the vi- 
sual system. The variability in the degree of response 
invariance during affine image transformations also hints 
to a multilayer, possibly hierachical architecture. 

Such a scheme is obviously oversimplified and lacks 
top-down mechanisms that strongly affect recognition 
performance. The processing of object information is un- 
doubtedly far more complex, and representations might 
be local and explicit or distributed and implicit accord- 
ing to the recognition task or the stimulus context. Al- 
though the ultimate goal of a recognition system is to 
describe grouped object-features in a more abstract for- 
mat that captures the invariant, three-dimensional, geo- 
metric properties of an object, early representations may 
be in some cases strongly configurational. Moreover, for 
visually complex, non-decomposable objects, like many 
biologically meaningful objects, holistic representations 
may be the only ones possible. Neurons selective for 
object- views and tolerant of varying extents of image 
transformations may then be elements of one possible 
mechanism for such representations. 
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